Nutch: an Open-Source Platform for Web Search

نویسنده

  • Doug Cutting
چکیده

Nutch is an open-source project providing both complete Web search software and a platform for the development of novel Web search methods. Nutch is built on a distributed storage and computing foundation, such that every operation scales to very large collections. Core algorithms crawl, parse and index Web-based data. Plugins extend functionality at various points, including network protocols, document formats, indexing schemas and query operators.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Implementation of MapReduce Algorithm and Nutch Distributed File System in Nutch

This paper provides an in-depth description of MapReduce algorithm and Nutch Distributed File System in Nutch web search engine. Nutch is an open-source Web search engine that can be used at global, local, and even personal scale. To engineer a search engine is a challenging task. Search engines index tens to hundreds of millions of web pages involving a comparable number of distinct terms. The...

متن کامل

Nutch: A Flexible and Scalable Open-Source Web Search Engine

Nutch is an open-source Web search engine that can be used at global, local, and even personal scale. Its initial design goal was to enable a transparent alternative for global Web search in the public interest — one of its signature features is the ability to “explain” its result rankings. Recent work has emphasized how it can also be used for intranets; by local communities with richer data m...

متن کامل

Full Text Search of Web Archive Collections

The Internet Archive, in cooperation with the International Internet Preservation Consortium, is developing an open source full text search of Web archive collections. Web archive collection search presents the usual set of technical difficulties searching large collections of documents. It also introduces new challenges often at odds with typical search engine usage. This paper outlines the ch...

متن کامل

Design and Implementation of Agricultural Production and Market Information Matching Recommendation System in the Cloud Environment

At present, in China, small farmers decentralized production doesn’t keep pace with the agricultural products market development requirements. This paper provides a new way by using the cloud computing technology to design and implement the agricultural production and marketing information matching recommendation system in the cloud computing environment. The platform collects agricultural mark...

متن کامل

TREC Dynamic Domain: Polar Science

This paper outlines the creation of the Polar dataset within the TREC-Dynamic Domain track. The techniques used to create the Polar dataset fall into two basic categories: information extraction using Apache Tika and information retrieval using Apache Nutch. Frist, we expanded the parsing capabilities of Apache Tika, an open source framework for text and metadata extraction, to provide more sea...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2005